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Abstract. Researches on signed languages still strongly dissociate lin- 
guistic issues related on phonological and phonetic aspects, and gesture 
studies for recognition and synthesis purposes. This paper focuses on the 
imbrication of motion and meaning for the analysis, synthesis and eval- 
uation of sign language gestures. We discuss the relevance and interest 
of a motor theory of perception in sign language communication. Ac- 
cording to this theory, we consider that linguistic knowledge is mapped 
on sensory-motor processes, and propose a methodology based on the 
principle of a synthesis-by-analysis approach, guided by an evaluation 
process that aims to validate some hypothesis and concepts of this the- 
ory. Examples from existing studios illustrate the different concepts and 
provide avenues for future work. 

1 Introduction 

The ever growing use of gestures in advanced technologies, such as augmented 
or virtual reality environments, requires more and more understanding of the 
different levels of representation of gestures, from meanings to motion charac- 
terized by causal physical and biological phenomena. This is even more true 
for skilled and expressive gestures, or for communicative gestures such as sign 
language gestures, both involving high level semiotic and cognitive representa- 
tions, and requiring extreme rapidity, accuracy, and physical engagement with 
the environment. 

In this paper we highlight the tight connection between the high-level and 
low-level representations involved in signed languages. Signed languages are 
fully-formed languages that serve as a means of communication between deaf 
people. They are characterized by meanings: they have their own rules of com- 
positional elements, grammatical structure, and prosody; but they also include 
multimodal components that are put into action by movements. They arc in- 
deed by essence multi-channel, in the sense that several modalities are impli- 
cated when performing motion: body, hands, facial expression, gaze direction, 
acting independently but participating all together to convey meaningful and 
discriminative information. In signed language storytelling for example, facial 
expressions may be used to qualify actions, emotions, and enhance meaning. 

We focus on data-driven models, which arc based on observations of real 
signed language gestures, using captured motion or videos. Motion capture al- 
lows us to find relevant representations that encode the main spatio-temporal 



characteristics of gestures. In the sanie way, analyzing videos may lead to anno- 
tations where significant labels indicate the morpho- syntactic nature of elements 
composing gestures, and may constitute a starting point for determining pho- 
netic structures. By combining both pieces of information, motion capture data 
and videos, we may also extract accurate low and high level features that help 
to understand sign language gestures. We believe that data-driven methods, in- 
corporating constraints extracted from observations, significantly improve the 
quality and the credibility of the synthesized motion. To go beyond, we propose 
this synthesis-by-analysis method, corrected by a perceptual evaluation loop, to 
model the underlying mechanisms of signed language gesture production. 

In the remainder of the paper, we propose a guideline aiming at characterizing 
the role of sensory-motor information for signed language understanding and 
production, based on the motor theory of sign language perception. We then 
provide a general methodology for analyzing, synthesizing, and evaluating signed 
language gestures, where different sensory data are used to extract linguistic 
features and infer motor programs, and to determine the action to perform in a 
global action-perception loop. The different concepts and models are illustrated 
by related works, both from the points of view of signed language linguistics and 
movement science communities. 

After describing related works in the next section, we propose sign language 
production and perception models underlying the motor theory of sign language 
perception. A methodology is then proposed to highlight how this theory may 
be exploited in both theoretical sign language research and motion sciences. 

2 Related Works 

There are two main approaches in modeling and producing sign language ges- 
tures, that are addressed differently in the different research communities: the 
first one, addressed by the signed language linguists, concerns the formation of 
the meaning from observations; the second one, addressed by motion science 
researchers, is related to motion generation and recognition from high-level sign 
descriptions. Most of the time, these two approaches are considered separately, 
as the two research communities do not share the same tools and methods. 

Linguistic researchers work on signed languages from some observation of nat- 
ural utterances, most often through video data: they build theories describing 
the mapping between these observations and linguistic components (phonetics, 
phonological structures, etc.). The resulting models are still widely debated in 
the sign language community, and usually, motion characterization is not seen as 
a prime objective for elaborating phonological model [T] or phonetic model [5]. 
In order to validate their observations and analysis, they need better knowledge 
of movement properties: kinematic invariants within signs and between signs, 
physical constraints, etc. Invariant laws in movements are discussed in [3]. 



Movement researchers on the other hand (bio-mechanicians, neuroscientists, 
computer animators, or roboticians) try to build simulation models that imitate 
real movements. Their approach consists, from high-level descriptions (plan- 
ning), of specifying a sequence of actions as a procedural program. They need 
to acquire better knowledge of the rules governing the system behavior, such 
as syntactic rules or parameterization of the sign components according to the 
discourse context. The next problem consists of interpreting these rules using 
specific computer languages (from scripting languages to procedural or reactive 
languages), and traducing them into sensory- motor processes underlying the 
physical system that produce movement. 

Most of the works in this area focus on the expressivity of the high-level 
computer languages, using descriptive or procedural languages, for example the 
XML-based specification language called SiGML [4 which is connected to the 
HamNoSys [S] notation system, and interpreted into signed language gestures 
using classical animation techniques. A more exhaustive overview of existing 
systems using virtual signers technology can be found in [B]. For these kinds 
of applications involving signed language analysis, recognition, translation, and 
generation, the nature of the performed gestures themselves is particularly chal- 
lenging. 

Alternatively, data-driven animation methods can be substituted for these 
pure synthesis methods. In this case the motions of a real signer are captured 
with different combinations of motion capture techniques. Though these methods 
significantly improve the quality and credibility of animations, there are nonethe- 
less several challenges to the reuse of motion capture data in the production of 
sign languages. Some of them are related to the spatialization of the content, but 
also to the rapidity and precision required in motion performances, and to the 
dynamic aspects of movements. All these factors are responsible for phonological 
inflection processes. Incorrectly manipulated, they may lead to imperfections in 
the performed signs (problems in timing variations or synchronization between 
channels) that can alter the semantic content of the sentence. A detailed discus- 
sion on the important factors for the design of virtual signers in regard to the 
animation problems is proposed in [7]. 

Little has been done so far to determine the role of sensory-motor activity 
for the understanding (perception and production) of signed languages. The 
idea that semantic knowledge is embodied into sensory-motor systems has given 
rise to many studies, bringing together researchers from domains as different as 
cognitive neuroscience and linguistics, but most of these works concern spoken 
languages. This interaction between language and action are based on different 
claims such as: 

— imagining and acting share the same neural substrate [5]; 

— language makes use in large part of brain structures akin to those used to 
support perception and action 0. 

Among these recent research interests, some researchers share the idea that 
motor production is necessarily involved in the recognition of sensory (audio, 



visual, etc.) encoded actions; this idea echoes what is cahed the motor theory of 
speech perception which holds that the listener recognizes speech by activating 
the motor programs that would produce sounds like those that are being heard 
|10| . Within this theory, sensory data are auditory or visual clues (mouth open- 
ing), and the motor actions are vocal gestures (movements of the vocal tract, 
tongue, lips, etc.). 

This theory can be easily transposed to sign languages, and we will call it 
the Motor Theory of Sign Language Perception. In this case too, the linguis- 
tic information is embodied into sensory-motor processes, where sensory data 
may be visual clues (iconic gestures, classifiers), or perception of action (contact 
between several body parts, velocity or acceleration characteristics, etc.). 

3 The Motor Theory of Sign Language Perception 

All the evidence briefly reported in the previous section tends to show that 
perception and production of language utterances are closely related. It remains 
to describe or model this relationship. At the light of this evidence, the motor 
theory of speech perception, which states that what we perceive is nothing but 
the movement of the articulatory system (body movements) , suggests that part 
of conceptual and language structures are encoded at motor program levels, 
e.g. as a sequence of motor actions allowing to produce the desired sensory (or 
perceptive) effect. 

Similarly to the motor theory of speech perception, the motor theory of sign 
language perception that we promote in this paper claims that what we perceive 
is the movement of body articulators, and that the encoding and decoding of 
linguistic information should be partly addressed at motor program level char- 
acterizing the movement intention. Furthermore, if we accept the idea that the 
motor program level is where the linguistic cues are encoded, then the motor 
theory of perception leads to consider that we can infer motor programs from ob- 
served sensory cues only (motor act). We call this inference an inversion process 
since its purpose is to deduce the cause from the consequence (sensory observa- 
tion). 

Therefore, if we go further in the modeling of these concepts, we assume 
that the motor theory of sign language perception is based on two inversion 
mechanisms, one for sign language production, and the other one for sign lan- 
guage perception. These mechanisms will be used as part of encoding and de- 
coding processes of linguistic units. By linguistic units we mean here phonetic 
and phonological elements specific to sign languages. 

The first inversion process for sign language production, also called encoding 
process, is represented in Figure [T] It is a closed-loop system, where the signer 
uses sensory information to produce the desired actions corresponding to a spe- 
cific motor program. The signer performing gestures perceives the environment 
through many sensory cues: he can view his interlocutor, and also the entities 




Fig. 1. Sign language production: encoding from motor program and linguistic 
information 



positioned in the signing space (spatial targets); he may also capture auditive, 
tactile (perception of touch), proprioceptive (perception of muscles and articu- 
lations), and kinesthetic clues (perception of velocity, acceleration, etc.) from its 
own body movements. These sensory cues are then inverted to provide motor 
commands that modify the current action applied to the musculor-skeleton sys- 
tem. When producing sign language gestures, the linguistic information is also 
exploited to generate a sign language utterance which is then translated into a 
motor program. 

In the context of sign language synthesis, the motor programs may be repre- 
sented by a sequence of goals, as for example key postures of the hand, or targets 
in hand motion, or facial expression targets. These targets are then interpreted 
into continuous motion, through an inverse kinematics or dynamics solver 

na, m- 

The second inversion process used for gesture perception, also called decoding 
process, is represented in Figure [2j From the observation of a signer, it consists 
in extracting multi-sensorial cues, and then to simultaneously infer motor pro- 
grams (allowing to reproduce the detected sensory cues), and extract linguistic 
information. 

Our approach to sign language perception can be divided into two kinds 
of analysis studies. The first one consists of a linguistic analysis that tries to 
extract phonetic or phonological features from the observation of signers. The 
second one consists in finding invariants or motor schemes in the data, above 
which one can build linguistic knowledge. 

This last approach, inspired from the neuroscience community, may exploit 
statistical tools in order to extract some regular features or schemes embodied 
to motion data. 




Fig. 2. Sign language perception: decoding for inferring motor program and 
extracting linguistic information 

4 Methodology: Sign Language Production and 
Perception 

In practice, production and perception are closely linked in a language com- 
munication context. In order to study jointly both mechanisms, we propose a 
general and experimental methodology based on an analysis (perception) / syn- 
thesis (production) approach, depicted in Figure [s] It contains the following 
three building blocks. 

— i) The Analysis block refers to the perception or decoding aspect of the 
methodology. It uses observed information from simultaneously captured 
motion data and videos. It is based on hypothesis related to the linguis- 
tically encoded structure of signs, and the motor programs underlying the 
performed gestures. In practice, given the different nature of information 
that should be encoded (symbolic and numerical) , it is more efficient to pro- 
cess and store data in two different structures, namely a semantic database 
for linguistic annotations, and a raw database for motion capture data; 

— ii) The Synthesis block covers the production or encoding aspect of the 
methodology. It is composed of a sensory-motor animation system which 
uses both a scripting process expressing a new utterance and the corre- 
sponding motor program that uses pre-recorded motion chunks. Moreover, 
a 3D rendering engine allows to visualize the virtual signer performing the 
signs; 

— iii) The Evaluation block makes possible the evaluation of the analysis hy- 
pothesis, at the light of the synthesized gestures. Deaf experts or sign lan- 
guage signers may indeed qualify the different performances (quality of the 
gestures, realism, understandability) , and propose some changes of the mod- 
els and sub-segment structures including motor program schemes. We con- 
jecture that during evaluation, based on their own sensory-motor inversion 
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Fig. 3. Analysis, synthesis, and evaluation methodology 



loop, experts or signers are implicitly able to validate or invalidate the syn- 
thesized motor performance and subsequently the hypothesis that have been 
made for the elaboration of the motor programs. 



This analysis-by-synthesis methodology requires to bring together researchers 
from different communities. Preliminary work has been undertaken on the basis 
of the collected data within the project SignCom [H]. Some models and results 
underlying the former methodology are presented below, in the context of anal- 
ysis of French sign language corpora, and data-driven synthesis. The use of 3D 
avatars driven by semantic and raw motion databases also allows us to go further 
the restrictions of videos, and to evaluate the feasibility and understandability 
of the models. 



Corpus and database The observational data are composed of 50 minutes of 
sign language motion captured data which gather data recorded with 43 body 
markers, 41 facial markers, and 12 hand markers, and videos of the same se- 
quences recorded with one camera. Some of the challenges posed by the corpus 
creation and the capture of heterogenous data flows are detailed in [T^ and [TH] . 
It should be noted that the choice of the corpus (choice of the thematics, lim- 
ited vocabulary, choice of lexical and non-lexical signs, motion forms, etc.) may 
potentially influence the analysis and synthesis processes. 

Analysis The previous corpus has been analyzed and indexed by sign language 
experts: we separated the linguistic indexing from the raw motion indexing. 

— The linguistic indexing is provided by annotations performed by sign 
language linguists associated to deaf people. Signs are generally decomposed 
into various components, such as location, handshape, and movement as pro- 
posed by Stokoe |17j . Since then, other linguists have expanded and modi- 
fied Stokoe's decompositional system, introducing wrist orientation, syllabic 
patterning, etc. [2]. However, signed languages are not restricted to con- 
veying meaning via the configuration and motion of the hand, but instead 
involve the simultaneous use of both manual and non-manual components. 
The manual components of signed language include hand configuration, ori- 
entation, and placement or movement, expressed in the signing space (the 
physical three-dimensional space in which the signs are performed). Non- 
manual components consist of the posture of the upper torso, head orienta- 
tion, facial expression, and gaze direction. 

Following this structural description of signs, we annotate the selected cor- 
pus, identifying each sign type found in the video data with a unique gloss 
so that each token of a single type can be easily compared, and segmenting 
the different tiers composing the signs by exploiting grammatical and phono- 
logical models [2 . The structure of the annotation scheme is characterized 
by: ^ ^ 

• a spatial structure, defined by several tiers and a structural organization 
by gathering several channels; 

• a temporal structure, resulting from manual and semi-automatic seg- 
mentation, allowing transitions / strokes labelling; 

• a manual labeling with elements and patterns borrowed from linguists; 
we have followed the phonetics model of Johnson and Liddell [5]. 

This annotation scheme allows to match motion data and phonetic struc- 
ture, as shown in figure |4j thus providing ways to index synchronously the 
motion to the phonetic tiers. 

— The motion indexing is based on motion processing. Sign language data 
have already been studied, following different approaches. We first identified 
phonological items, described as sequences of motion targets and handshape 



targets [12], and used motor control rules, as the ones described in to 
produce realistic hand motion. 

Using motion captured data from French sign language corpora, we have 
also developed specific analysis methods that have led to the extraction 
of low-level or high-level motor schemes. We first automatically segmented 
handshape sequences |18) . or hand movements that may be correlated to 
motor control laws ^9\. Secondly, statistical analysis have been conducted 
to characterize the phasing between hand motion and handshape [20], or to 
categorize hand motion velocity profiles within signs or during transitions 
between signs (21j (controlled, ballistic, and inverse-ballistic movements). 
Similar works have been carried out to show the temporal organization in 
Cued Speech production |22j . 

We also implemented a two-levels indexed database (semantic and raw data) 
[25] . From such database, it will be possible to go further in the statistical anal- 
ysis, and thus extract other invariants features and motor schemes, and to use 
them for re-synthesis. 

Synthesis Conversely, using these enriched databases to produce new utter- 
ances from the corpus data remains challenging regarding the hypothesis derived 
from the analysis processes. Different factors may be encoded into the motor pro- 
gram driving the synthesis engine, such as the dynamics of the gestures (velocity 
profiles, etc.), the synchronization between the channels, or the coarticulation 
effects by using the sequence of pre-identified targets. 

The multichannel animation system for producing utterances signed in French 
Sign Language (LSF) by a virtual character is detailed in [6] . 

Evaluation Concerning evaluation issues, the idea is not so much to evaluate 
the signing avatar, but to evaluate the different hypotheses related to the de- 
coding of signs, from the observation of sign language performances, and to the 
corresponding encoding of signs within the synthesis system. With this analysis- 
by-synthesis approach it is possible to possibly refine the different hypothesis 
and to help understanding the coupled production-perception mechanisms. 

Currently, the research community focuses on the usability of the avatar. 
The evaluation process can be divided into two processes: i) the evaluation of 
the acceptability of the avatar, which can be measured by human- likeness, coor- 
dination, fluidity, realism of the three-dimensional rendering; ii) the evaluation 
of the understandability of the avatar, which requires the recognition of signs 
by measuring the precision of the signs, the co-articulation effects, etc., and 
measuring the level of recognition of the sentences and the story. A prelimi- 
nary evaluation has been performed in |B]. Understanding, characterizing more 
thoroughly the production and perception of sign language in the context of a 
motor theory of perception is a natural and promising perspective that should 
be carried out in the near future. 



5 Conclusion 



This paper promotes a motor theory of sign language perception as a guideline 
for the understanding of linguistic encoding and decoding of sign language ges- 
tures. According to this theory, what we perceive is nothing but the movement of 
the body's articulators. In other words, this assumption states that the linguistic 
knowledge is mapped onto sensory-motor processes. Such an a priori statement 
relies on two main hypothesis: firstly, we are able to infer motor data from sen- 
sory data through a sensory motor inversion process, and secondly, elements of 
linguistic information are somehow encoded into motor programs. A methodol- 
ogy straightforwardly derived from these two hypothesis and based on a so-called 
analysis-by-synthesis loop is detailed. This loop, through a perceptive evaluation 
carried out by sign language experts, allows to validate or invalidate hypothesis 
on linguistic encoding at motor program levels. Although much work remains 
to be done to validate the methodology and the motor theory of sign language 
perception itself, its feasibility and practicality has been demonstrated in the 
context of French sign language corpora analysis and data-driven synthesis. 

It should be noted that the study of sign languages is a favorable field for 
validating motor theories of perception, since it is rather easy to infer the ar- 
ticulators' movements from sensory data (captured data and videos). However, 
this promising interdisciplinary research orientation requires the involvement of 
sign language linguists, deaf signers, neuroscientists and computer scientists. 
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